Wikipedia Network Analysis

By Brian Keegan, Ph.D. -- October 4, 2014

Importing libraries we'll want to use throughout the analysis right away.



In [3]:

    
# Standard packages for data analysis
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# pandas handles tabular data
import pandas as pd

# networkx handles network data
import networkx as nx

# json handles reading and writing JSON data
import json

# To visualize webpages within this webpage
from IPython.display import HTML

# To run queries against MediaWiki APIs
from wikitools import wiki, api

# Some other helper functions
from collections import Counter
from operator import itemgetter

Write a basic query against the Wikipedia API

Running a test query in the browser

We are going to make a query using the list=users action. This page contains all the documentation for the different queries you can run from Wikipedia's MediaWiki API. For our first test query, we'll want to get the information about a single user. Search for "list=users". You can also find similar information about this specific query action here in the general MediaWiki documentation.

We can actually write a test query in the URL which will return results if the parameters are all valid. Use the example given on the api.php page of documentation:

There are four parameters in this API call, separated by & signs:

action - We pass a query option here to differentiate it from other actions we can run on the API like parse. But action=query will be what we use much of the time.
list - This is one of several parameters we can use to make a query; search for "action=query" for others besides list. We pass a users option to list because we want to generate information about users. This lets us run the sub-options detailed in the documentation below.
ususers - Here we list the names of Wikipedia users we want to get information about. We can pass more than one name by adding a pipe "|" between names. The documentation says we can only pass up to 50 names per request. Here we pass two names Madcoverboy for yours truly and Jimbo_Wales for the founder of Wikipedia.
usprop - Here we pass a list of options detailed under the list=users about information we can obtain about any user. Again we use pipes to connect multiple options together. We are going to get information about whether a user is currently blocked (blockinfo), what powers the user has (groups), their total number of edits (editcount), the date and time they registered their account (registration), and their self-reported gender (gender).

In summary, this API request is going to perform a query action that expects us to pass a list of user names and will return information about the users. We have given the query the names of the users we want information about as well as the specific types of information about each of these users.

The codeblock below shows what clicking the URL should return.



In [27]:

    
HTML('http://en.wikipedia.org/w/api.php?action=query&list=users&ususers=Madcoverboy|Jimbo_Wales&usprop=blockinfo|groups|editcount|registration|gender')









    Out[27]:








	MediaWiki API Result





You are looking at the HTML representation of the XML format.

HTML is good for debugging, but is unsuitable for application use.

Specify the format parameter to change the output format.

To see the non HTML representation of the XML format, set format=xml.

See the complete documentation, or
API help for more information.

<?xml version="1.0"?>
<api>
  <query>
    <users>
      <user userid="304994" name="Madcoverboy" editcount="12348" registration="2005-06-21T13:52:16Z" gender="male">
        <groups>
          <g>reviewer</g>
          <g>*</g>
          <g>user</g>
          <g>autoconfirmed</g>
        </groups>
      </user>
      <user userid="24" name="Jimbo Wales" editcount="11768" registration="2001-03-27T20:47:31Z" gender="unknown">
        <groups>
          <g>checkuser</g>
          <g>founder</g>
          <g>oversight</g>
          <g>sysop</g>
          <g>*</g>
          <g>user</g>
          <g>autoconfirmed</g>
        </groups>
      </user>
    </users>
  </query>
</api>

There's a lot of padding and fields from the XML markup this returns by default, but the data are all in there. My userid is "304994", my username is "Madcoverboy" (which we already knew), I have 12,348 edts, I registered my account on June 21, 2005 at 1:52:16pm GMT, I identify as male, and I'm a member of four "groups" corresponding to my editing privileges: reviewer, *, user, and autoconfirmed.

Running the same query in Python

Clicking on the link will run the query and return the results in your web browser. However, the point of using an API is not for you to make queries with a URL and then copy-paste the results into Python. We're going to run the query within Python (rather than the web browser) and return the data back to us in a format that we can continue to use for analysis.

First we need to write a function that will accept something that corresponds to the query we want to run, goes out and connects to the English Wikipedia's MediaWiki API "spigot", formats our query for this API to understand, runs the query until all the results come back, and then returns the results to us as some data object. The function below does all of those things, but it's best to just treat it as a black box for now that accepts queries and spits out the results from the English Wikipedia.

(If you want to use another another MediaWiki API, replace the current URL following site_url with the corresponding API location. For example, Memory Alpha's is http://en.memory-alpha.org/api.php)



In [5]:

    
def wikipedia_query(query_params,site_url='http://en.wikipedia.org/w/api.php'):
    site = wiki.Wiki(url=site_url)
    request = api.APIRequest(site, query_params)
    result = request.query()
    return result[query_params['action']]

We can write the exact same query as we used above using a dictionary for all the same request parameters as key value pairs and save the dictionary as user_query. For example, where we used action=query in the URL above, we use 'action':'query' as a key-value pair of strings (make sure to include the quotes marking these as strings, rather than variables!) in the query dictionary. Then we can pass this query dictionary to the wikipedia_query black box function defined above to get the exact same information out. We save the output in query_results and can look at the results by calling this variable.



In [6]:

    
user_query = {'action':'query',
              'list':'users',
              'usprop':'blockinfo|groups|editcount|registration|gender',
              'ususers':'Madcoverboy|Jimbo Wales'}

query_results = wikipedia_query(user_query)
query_results









    Out[6]:





{u'users': [{u'editcount': 12348,
   u'gender': u'male',
   u'groups': [u'reviewer', u'*', u'user', u'autoconfirmed'],
   u'name': u'Madcoverboy',
   u'registration': u'2005-06-21T13:52:16Z',
   u'userid': 304994},
  {u'editcount': 11778,
   u'gender': u'unknown',
   u'groups': [u'checkuser',
    u'founder',
    u'oversight',
    u'sysop',
    u'*',
    u'user',
    u'autoconfirmed'],
   u'name': u'Jimbo Wales',
   u'registration': u'2001-03-27T20:47:31Z',
   u'userid': 24}]}

The data structure that is returned is a dictionary keyed by 'users' which returns a list of dictionaries. Knowing that the data corresponding to Jimbo Wales is the second element in the list of dictionaries (remember Python indices start at 0, so the 2nd element corresponds to 1), we can access his edit count.



In [7]:

    
query_results['users'][1]['editcount']









    Out[7]:





11778

Write a function to simplify the process

Instead of writing each query manually, we can define a function get_user_properties that accepts a user name(s), and returns ther results of the query used above but replaces "Madcoverboy" and "Jimbo Wales" with the user name(s) passed.



In [8]:

    
def get_user_properties(user):
    result = wikipedia_query({'action':'query',
                              'list':'users',
                              'usprop':'blockinfo|groups|editcount|registration|gender',
                              'ususers':user})
    return result

We can test this function on another user, "Koavf" who is the most active user on the English Wikipedia. We'll save his results to koavf_query_results.



In [9]:

    
koavf_query_results = get_user_properties('Koavf')
koavf_query_results









    Out[9]:





{u'users': [{u'editcount': 1335895,
   u'gender': u'male',
   u'groups': [u'accountcreator',
    u'autoreviewer',
    u'epcampus',
    u'eponline',
    u'filemover',
    u'ipblock-exempt',
    u'reviewer',
    u'rollbacker',
    u'templateeditor',
    u'*',
    u'user',
    u'autoconfirmed'],
   u'name': u'Koavf',
   u'registration': u'2005-03-05T20:22:47Z',
   u'userid': 205121}]}

Write the data we collected to disk

All the data we've collected in query_results and koavf_query_results exist only in memory. Once we shut this notebook down, these data will cease to exist. So we'll want to save these data to disk by "serializing" into a format that other programs can use. Two very common file formats are JavaScript Object Notation (JSON) and Comma-separated Values (CSV). JSON is better for more complex data that contains a mixture of strings, arrays (lists), dictionaries, and booleans while CSV is better for "flatter" data that you might want to read into a spreadsheet.

We can save a the koavf_query_results as a JSON file by creating and opening a file named koavf_query_results.json and referring to this connection as f. We use the json.dump function to translate all the data in the koavf_query_results dictionary into the file and once this process is done, the file automatically closes itself so that other programs can access it.



In [10]:

    
with open('koavf_query_results.json','wb') as f:
    json.dump(koavf_query_results,f)

Check to make sure this data was properly exported by reading it back in as loaded_koavf_query_results.



In [11]:

    
with open('koavf_query_results.json','rb') as f:
    loaded_koavf_query_results = json.load(f)
    
loaded_koavf_query_results









    Out[11]:





{u'users': [{u'editcount': 1335895,
   u'gender': u'male',
   u'groups': [u'accountcreator',
    u'autoreviewer',
    u'epcampus',
    u'eponline',
    u'filemover',
    u'ipblock-exempt',
    u'reviewer',
    u'rollbacker',
    u'templateeditor',
    u'*',
    u'user',
    u'autoconfirmed'],
   u'name': u'Koavf',
   u'registration': u'2005-03-05T20:22:47Z',
   u'userid': 205121}]}

The query_results data has two "observations" corresponding to "Madcoverboy" and "Jimbo Wales". We could create a CSV with the columns corresponding to the field names (editcount, gender, groups, name, registration, userid) and then two rows containing the corresponding values for each user.

Using a powerful library called "pandas" (short for "panel data", not the cute bears), we can pass the list of data inside query_results and pandas will attempt to convert it to a tabular format called a DataFrame that can be exported to CSV. We save this as df and then use the to_csv function to write this DataFrame to a CSV file. We use to extra options, declaring quotation marks to make sure the data in groups, which already contains commas, doesn't get split up later. We don't care about the row numbers (the index), so we declare index=False so they aren't exported.



In [12]:

    
query_results['users']









    Out[12]:





[{u'editcount': 12348,
  u'gender': u'male',
  u'groups': [u'reviewer', u'*', u'user', u'autoconfirmed'],
  u'name': u'Madcoverboy',
  u'registration': u'2005-06-21T13:52:16Z',
  u'userid': 304994},
 {u'editcount': 11778,
  u'gender': u'unknown',
  u'groups': [u'checkuser',
   u'founder',
   u'oversight',
   u'sysop',
   u'*',
   u'user',
   u'autoconfirmed'],
  u'name': u'Jimbo Wales',
  u'registration': u'2001-03-27T20:47:31Z',
  u'userid': 24}]



In [13]:

    
df = pd.DataFrame(query_results['users'])
df.to_csv('query_results.csv',quotechar='"',index=False)
df









    Out[13]:






  
    
      
      editcount
      gender
      groups
      name
      registration
      userid
    
  
  
    
      0
       12348
          male
                      [reviewer, *, user, autoconfirmed]
       Madcoverboy
       2005-06-21T13:52:16Z
       304994
    
    
      1
       11778
       unknown
       [checkuser, founder, oversight, sysop, *, user...
       Jimbo Wales
       2001-03-27T20:47:31Z
           24

Check to make sure this data was properly exported by reading it back in.



In [14]:

    
pd.read_csv('query_results.csv',quotechar='"')









    Out[14]:






  
    
      
      editcount
      gender
      groups
      name
      registration
      userid
    
  
  
    
      0
       12348
          male
          [u'reviewer', u'*', u'user', u'autoconfirmed']
       Madcoverboy
       2005-06-21T13:52:16Z
       304994
    
    
      1
       11778
       unknown
       [u'checkuser', u'founder', u'oversight', u'sys...
       Jimbo Wales
       2001-03-27T20:47:31Z
           24

Summary

In this section, we've covered the basics for how:

to look up documentation about the MediaWiki API
format a basic query and test to make sure it works within a web browser
running the same query within Python
writing a function to simplify querying data
writing the results of these queries to files so we can access these data later.

In the next sections, we'll use other queries to get more interesting data about relationships and more advanced data manipulation techniques to prepare these data for social network analysis.

Create hyperlink network

List of articles currently linked from an article

We are going to use the prop=links query to identify a list of articles that are currently linked from an article. We will use the article for "Hillary Rodham Clinton". The general MediaWiki documentation for this query is here. We will specify a query using the action=query to define a general class of query, the prop=links to indicate we want the current links from a page, and then pass the name of a page with titles=Hillary Rodham Clinton.

There are many "namespaces" of Wikipedia pages that reflect different kinds of pages for articles, article talk pages, user pages, user talk pages, and other administrative pages. Links to and from a Wikipedia article can come from all these name spaces, but because the Wikipedia articles that 99% of us ever read are located inside the "0" namespace, we'll only want to limit ourselves to links in that namespace rather than these "backchannel" links. We enforce this limit with the plnamespace=0 option.

There could potentially be hudreds of links from a single article but the API will only return some number per request. The wikitools library takes care of automatically generating additional requests if there is more data to obtain after the first request. Ideally, we could specify a large number like 10,000 to make sure we get all the links with a single request, but the API enforces a limit of 500 links per request and defaults to only 10 per request. We use the pllimit=500 to make sure we get the maximum number of links per request instead of issuing 50 requests.



In [16]:

    
outlink_query = {'action': 'query',
                 'prop': 'links',
                 'titles': 'Hillary Rodham Clinton',
                 'pllimit': '500',
                 'plnamespace':'0'}

hrc_outlink_data = wikipedia_query(outlink_query)



In [78]:

    
hrc_outlink_data['pages'][u'5043192']['links'][:5]









    Out[78]:





[{u'ns': 0, u'title': u'Virginia Clinton Kelley'},
 {u'ns': 0, u'title': u'2008 Democratic National Convention'},
 {u'ns': 0, u'title': u'Scottish people'},
 {u'ns': 0, u'title': u'United Press International'},
 {u'ns': 0, u'title': u'Edmund Hillary'}]

The data returned by this query is a dictionary of dictionaries that you'll need to dive "into" more deeply to access the data itself. The top dictionary contains a single key 'pages' which returns a dictionary containing a single key u'5043192' corresponding to the page ID for the article. Once you're inside this dictionary, you can access the list of links, which are unfortunately a list of dictionaries! Using something called a "list comprehension", I can clean this data up to get a nice concise list of links, which we save as hrc_outlink_list. I also print out the number of links in this list and 10 examples of these links.



In [19]:

    
hrc_outlink_list = [link['title'] for link in hrc_outlink_data['pages'][u'5043192']['links']]
print "There are {0} links from the Hillary Rodham Clinton article".format(len(hrc_outlink_list))
hrc_outlink_list[:10]









    



There are 1419 links from the Hillary Rodham Clinton article






    Out[19]:





[u'Virginia Clinton Kelley',
 u'2008 Democratic National Convention',
 u'Scottish people',
 u'United Press International',
 u'Edmund Hillary',
 u'Madeleine Albright',
 u'Alan Keyes presidential campaign, 2008',
 u'Margaret Bourke-White',
 u'Anna Harrison',
 u'Jeremiah S. Black']

A note on redirects

Note that there is an article for "Hillary Clinton" as well, but this article is a redirect. In other words, this article exists and has data that can be accessed from the API, but it's suspiciously sparse and just points to "Hillary Rodham Clinton".



In [20]:

    
outlink_query_hc = {'action': 'query',
                    'prop': 'links',
                    'titles': 'Hillary Clinton',
                    'pllimit': '500',
                    'plnamespace': '0'}

hc_outlink_data = wikipedia_query(outlink_query_hc)
hc_outlink_data









    Out[20]:





{u'pages': {u'39797486': {u'links': [{u'ns': 0,
     u'title': u'Hillary Rodham Clinton'}],
   u'ns': 0,
   u'pageid': 39797486,
   u'title': u'Hillary Clinton'}}}

The MediaWiki API has a redirects option that lets us ignore these placeholder redirect pages and will follow the redirect to take us to the intended page. Adding this option to the query but specifying the same Hillary Clinton value for the titles parameter that previously led to a redirect now returns all the data at the "Hillary Rodham Clinton" article. We'll make sure to use this redirects option in future queries.



In [21]:

    
outlink_query_hc_redirect = {'action': 'query',
                             'prop': 'links',
                             'titles': 'Hillary Clinton', # still "Hillary Clinton"
                             'pllimit': '500',
                             'plnamespace': '0',
                             'redirects': 'True'} # redirects parameter added

hcr_outlink_data = wikipedia_query(outlink_query_hc_redirect)
hcr_outlink_list = [link['title'] for link in hcr_outlink_data['pages'][u'5043192']['links']]
print "There are {0} links from the Hillary Clinton article".format(len(hcr_outlink_list))
hcr_outlink_list[:10]









    



There are 1419 links from the Hillary Clinton article






    Out[21]:





[u'Virginia Clinton Kelley',
 u'2008 Democratic National Convention',
 u'Scottish people',
 u'United Press International',
 u'Edmund Hillary',
 u'Madeleine Albright',
 u'Alan Keyes presidential campaign, 2008',
 u'Margaret Bourke-White',
 u'Anna Harrison',
 u'Jeremiah S. Black']

List of articles currently linking to an article

We are going to use the prop=linkshere query to identify a list of articles that currently link to the Hillary Rodham Clinton article. The parameters for this query are a bit different. We still use a namespace limitation so we are only getting pages in the article namespace by specifying lhnamespace=0 and we want to maximize the number of links per query that the API allows by specifying lhlimit=500. However, we don't want to include redirects that point to this article (e.g., "Hillary Clinton" points to "Hillary Rodham Clinton") by specifying lhshow=!redirect. Finally we only want the names of the articles, rather than less important information like "pageid" or "redirects", so we can limit this by specifying lhprop=title.



In [22]:

    
inlink_query_hrc = {'action': 'query',
                    'redirects': 'True',
                    'prop': 'linkshere',
                    'titles': 'Hillary Rodham Clinton',
                    'lhlimit': '500',
                    'lhnamespace': '0',
                    'lhshow': '!redirect',
                    'lhprop': 'title'}

hrc_inlink_data = wikipedia_query(inlink_query_hrc)

Again some data processing and cleanup is necessary to drill down into the dictionaries of dictionaries to extract the list of links from the data returned by the query. I use a similar list comprehension as above to get this list of links out. Again, I count the number of links in this list and give an example of 10 links.



In [24]:

    
hrc_inlink_list = [link['title'] for link in hrc_inlink_data['pages'][u'5043192']['linkshere']]
print "There are {0} links to the Hillary Rodham Clinton article".format(len(hrc_inlink_list))
hrc_inlink_list[:10]









    



There are 1835 links to the Hillary Rodham Clinton article






    Out[24]:





[u'Virginia Clinton Kelley',
 u'2011 attack on the British Embassy in Iran',
 u'James Watson (New York)',
 u'International security',
 u'Shannon County, Missouri',
 u'Margaret Bourke-White',
 u'2008 Democratic National Convention',
 u'Alice Paul',
 u'Foreign relations of Mexico',
 u'HRC: State Secrets and the Rebirth of Hillary Clinton']

Combining queries

In the previous two sections, we came up with two separate queries to get both the links from an article and the links to an article. However, much to the credit of the MediaWiki API engineers, you can combine both queries into one. We'll need all the same parameter information that we had included before (pllimit, lhlimit, etc.), but we can combine the queries together by combining prop=links and prop=linkshere with a pipe (like we did with user names in the very first query), prop=links|linkshere.



In [25]:

    
alllinks_query_hrc = {'action': 'query',
                      'redirects': 'True',
                      'prop': 'links|linkshere', #combined both prop calls with a pipe
                      'titles': 'Hillary Rodham Clinton',
                      'pllimit': '500', #still need the "prop=links" "pl" parameters and below
                      'plnamespace': '0',
                      'lhlimit': '500', #still need the "prop=linkshere" "lh" parameters and below
                      'lhnamespace': '0',
                      'lhshow': '!redirect',
                      'lhprop': 'title'}

hrc_alllink_data = wikipedia_query(alllinks_query_hrc)

Again, we need to do some data processing and cleanup to get the lists of links out. However, there are now two different sub-dictionaries within hrc_alllink_data object reflecting the output from the links and the linkshere calls.



In [26]:

    
hrc_alllink_outlist = [link['title'] for link in hrc_alllink_data['pages'][u'5043192']['links']]
hrc_alllink_inlist = [link['title'] for link in hrc_alllink_data['pages'][u'5043192']['linkshere']]
print "There are {0} out links from and {1} in links to the Hillary Rodham Clinton article".format(len(hrc_alllink_outlist),len(hrc_alllink_inlist))









    



There are 1419 out links from and 1835 in links to the Hillary Rodham Clinton article

We can also write a function get_article_links that takes an article name as an input and returns the lists containing all the in and out links for that article. We use the combined query described above, but replace Hillary's article title with a generic article variable, run the query, pull out the page_id, and then do the data processing and cleanup to produce a list of outlinks and a list of inlinks, both of which are passed back out of the function. Again, this query will only pull out the current links on the article, not historical links.



In [27]:

    
def get_article_links(article):
    query = {'action': 'query',
             'redirects': 'True',
             'prop': 'links|linkshere',
             'titles': article, # the article variable is passed into here
             'pllimit': '500',
             'plnamespace': '0',
             'lhlimit': '500',
             'lhnamespace': '0',
             'lhshow': '!redirect',
             'lhprop': 'title'}
    results = wikipedia_query(query) # do the query
    page_id = results['pages'].keys()[0] # get the page_id
    
    if 'links' in results['pages'][page_id].keys(): #sometimes there are no links
        outlist = [link['title'] for link in results['pages'][page_id]['links']] # clean up outlinks
    else:
        outlist = [] # return empty list if no outlinks
    
    if 'linkshere' in results['pages'][page_id].keys(): #sometimes there are no links
        inlist = [link['title'] for link in results['pages'][page_id]['linkshere']] # clean up inlinks
    else:
        inlist = [] # return empty list if no inlinks
    return outlist,inlist

We can test this on Bill Clinton's article, for example.



In [28]:

    
bc_out, bc_in = get_article_links("Bill Clinton")
print "There are {0} out links from and {1} in links to the Bill Clinton article".format(len(bc_out),len(bc_in))









    



There are 1249 out links from and 9460 in links to the Bill Clinton article

Lets put the data for both these queries into a dictionary called clinton_link_data so it's easier to access and save. We'll save this data to disk as a JSON as well so we can access it in the future.



In [30]:

    
clinton_link_data = {"Hillary Rodham Clinton": {"In": hrc_alllink_inlist,
                                                "Out": hrc_alllink_outlist},
                     "Bill Clinton": {"In": bc_in,
                                      "Out": bc_out}
                     }

with open('clinton_link_data.json','wb') as f:
    json.dump(clinton_link_data,f)

Make a network

Having collected data about the neighboring articles that are linked to or from one article, we can turn these data into a network. Using the NetworkX library (shortened to nx on import at the top), we will create a DiGraph object called hrc_g and then fill it with the connection data we just collected. We do this by iterating over the lists of links (hrc_alllink_outlist and hrc_alllink_inlist) and adding a directed edge between each neighbor and the original article. It's important to pay attention to edge direction as the out links should start at "Hillary Rodham Clinton" and end at the neighboring article whereas the in links should start at the neighboring article and end at "Hillary Rodham Clinton".



In [31]:

    
hrc_alllink_outlist[:5]









    Out[31]:





[u'Virginia Clinton Kelley',
 u'2008 Democratic National Convention',
 u'Scottish people',
 u'United Press International',
 u'Edmund Hillary']



In [32]:

    
hrc_g = nx.DiGraph()

for article in hrc_alllink_outlist:
    hrc_g.add_edge("Hillary Rodham Clinton",article)
    
for article in hrc_alllink_inlist:
    hrc_g.add_edge(article,"Hillary Rodham Clinton")

We can compute some basic statistics about the network such as the number of nodes.



In [34]:

    
len(hrc_alllink_outlist) + len(hrc_alllink_inlist)









    Out[34]:





3254



In [33]:

    
hrc_g.number_of_nodes()









    Out[33]:





2647



In [122]:

    
print "There are {0} edges and {1} nodes in the network".format(hrc_g.number_of_edges(), hrc_g.number_of_nodes())









    



There are 3254 edges and 2647 nodes in the network

We might also ask how many of these hyperlink edges are reciprocated, or link in both directions. We start with an empty container reciprocal_edges we'll use to fill with edges that are reciprocated. Next, we iterate through all the edges in the graph (hrc_g.edges() returns a list of all edges) and check two things. The first check is for whether the graph contains an edge the goes in the opposite direction. So given an edge (i,j), we check if there's also a (j,i). The second check is to make sure we haven't already added this edge to the reciprocal_edges list. If both these conditions are true, then we can add the edge to reciprocal_edges.



In [35]:

    
reciprocal_edges = list()
for (i,j) in hrc_g.edges():
    if hrc_g.has_edge(j,i) and (j,i) not in reciprocal_edges:
        reciprocal_edges.append((i,j))
        
reciprocation_fraction = round(float(len(reciprocal_edges))/hrc_g.number_of_edges(),3)
print "There are {0} reciprocated edges out of {1} edges in the network, giving a reciprocation fraction of {2}.".format(len(reciprocal_edges),hrc_g.number_of_edges(),reciprocation_fraction)









    



There are 608 reciprocated edges out of 3254 edges in the network, giving a reciprocation fraction of 0.187.

We can compare this to the network for Bill Clinton. There are many more edges in his network, but a much smaller fraction of these edges are reciprocated. This suggests that there are fewer articles expressing some similarity or relationship with Bill Clinton that his article also acknowledges by linking. This in turn invites questions about:

how the rate of reciprocity differs among biographies versus geographic entities
contemporary versus historical elites
high versus low quality articles
how these rates of reciprocation change over the evolution of the article's collaboration

With the query we've covered above, you canbegin to answer these open questions.



In [36]:

    
bc_g = nx.DiGraph()

for article in bc_out:
    bc_g.add_edge("Bill Clinton",article)
    
for article in bc_in:
    bc_g.add_edge(article,"Bill Clinton")

bc_reciprocal_edges = list()
for (i,j) in bc_g.edges():
    if bc_g.has_edge(j,i) and (j,i) not in bc_reciprocal_edges:
        bc_reciprocal_edges.append((i,j))
        
bc_reciprocation_fraction = round(float(len(bc_reciprocal_edges))/bc_g.number_of_edges(),3)
print "There are {0} reciprocated edges out of {1} edges in the network, giving a reciprocation fraction of {2}.".format(len(bc_reciprocal_edges),bc_g.number_of_edges(),bc_reciprocation_fraction)









    



There are 926 reciprocated edges out of 10709 edges in the network, giving a reciprocation fraction of 0.086.

This is a pretty basic "star"-shaped network that contains Hillary's article at the center and is surrounded by all the articles linking to and from it. In particular, we could "snowball" out from the the articles that link to and are linked from a given page and visit each of those articles and create their local networks. We could continue to do this until we traverse the whole hyperlink network, but that would take a very long time, involve a lot of data, and would be an abusive use of the API (if you want the whole hyperlink network, you can download the data directly here by clicking a backup date and searching for "Wiki page-to-page link records.").

We could also create the "1.5-step ego" hyperlink network around a given page that consists of the focal article, all the articles that link to or from it, and then whether these neighboring articles are linked to each other. This could provide a better picture of which neighboring articles link to which other articles.

Unfortunately, even the scrape for the 2-step ego hyperlink network could take over an hour of data collection and generate hundreds of megabytes of data. Furthermore, Wikipedia article also contain templates which creates lots of "redundant" links between articles that share templates even those these links don't appear in the body of the article itself. You'll need to do much more advanced text parsing of wiki-markup to actually get links in the body of an article, but that's beyond the scope of the present tutorial.

I don't recommend crawling more than the immediate (1-step) neighbors of Wikipedia articles.

Links from historical versions of an article

The queries above only looked at the links coming from the current version of the article. However Wikipedia archives every version of the article, so we can rewind the tape all the way back to the first version of Hillary's article back in 2001, a few months after Wikipedia was created. Specific versions of a Wikipedia article are identified with a revid, which is also called an oldid in some contexts. In subsequent sections, we'll go into more detail on how to get a list of all revisions to an article and find the oldest revision. But for the time being, just trust me that revid "256189" is the oldest version of the Hillary Rodham Clinton article. Take a peek at what the article looked like back then below:



In [37]:

    
HTML('<iframe src=https://en.wikipedia.org/w/index.php?title=Hillary_Rodham_Clinton&oldid=256189&useformat=mobile width=700 height=350></iframe>')









    Out[37]:

The MediaWiki API allows us to extract the out links from this old version of the article. Here we'll perform a different kind of action on the API than the previous query parameter we've used. The action=parse will extract information from a given version of an article, such as the links. We can specify that links should be parsed out with the prop=links parameter. Finally, we pass the oldid=256189 so that this specific revision is parsed.



In [38]:

    
oldest_outlinks_query_hrc = {'action': 'parse', #query changes to parse
                             'prop': 'links',
                             'oldid': '256189'}

oldest_outlinks_data = wikipedia_query(oldest_outlinks_query_hrc)
oldest_outlinks_data









    



URLError: <urlopen error [Errno 65] No route to host> trying request again in 5 seconds






    Out[38]:





{u'links': [{u'*': u'Baby Boom', u'exists': u'', u'ns': 0},
  {u'*': u'Bill Clinton', u'exists': u'', u'ns': 0},
  {u'*': u'First Lady', u'exists': u'', u'ns': 0},
  {u'*': u'New York', u'exists': u'', u'ns': 0},
  {u'*': u'October 26', u'exists': u'', u'ns': 0},
  {u'*': u'Senators Of The United States', u'exists': u'', u'ns': 0},
  {u'*': u'United States/President', u'exists': u'', u'ns': 0},
  {u'*': u'United States Senate', u'exists': u'', u'ns': 0},
  {u'*': u'Watergate', u'exists': u'', u'ns': 0}],
 u'revid': 256189,
 u'title': u'Hillary Rodham Clinton'}

Again, data processing and cleanup using a list comprehension is necessary to get a list of links from this result.



In [86]:

    
oldest_outlink_list = [link['*'] for link in oldest_outlinks_data['links']]
print "There are {0} out links from the Hillary Rodham Clinton article".format(len(oldest_outlink_list))
oldest_outlink_list









    



There are 9 out links from the Hillary Rodham Clinton article






    Out[86]:





[u'Baby Boom',
 u'Bill Clinton',
 u'First Lady',
 u'New York',
 u'October 26',
 u'Senators Of The United States',
 u'United States/President',
 u'United States Senate',
 u'Watergate']

So now we can also extract links from historical versions of the article. However, it's much more difficult to get the history of what links in to an article (e.g., linkshere) as this would require potentially looking at the history of every other article to check if a link was ever made from that article to another article. This is not impossible, just very very time-consuming.

Summary

In this section we learned to write and combine queries to get us the links to and from the current version of an article, clean the output of these queries up into lists of links, use these lists of links to make a network object, and did some preliminary analysis of an article's ego network. There are some limitations on the specificity of the links that the API passes back which limits our ability to generate more complex networks using this query. We also showed that it's possible to get the out links from a historical version of an article using a new kind of API action called a parse. Using the out links from all the changes to an article could let us look at the evolution of what the article linked to over time. We'll go into how to get all the changes to an article in the next section.

Get all the revisions made to an article

The previous section showed how to make a basic network from the current hyperlinks to and from a Wikipedia article. It also alluded to the fact that Wikipedia captures the history of every change made to the article since it was created as well as who made these changes and when (among other meta-data). In this section, we'll explore some queries around how to extract the "revision history" of an article from the API. We'll do some exploratory analysis using these data to understand patterns in the distribution of editors' activity, changes in content, and the persistence of revisions. Additionally, we'll construct a co-authorship network of what editors made a change to the article.

Starting with a basic query, we'll get every change that's been made to the "Hillary Rodham Clinton" article. We'll use action=query and prop=revisions to get the list of changes to an article (see detailed documentation here). There are many options to specify here. We pass several options to rvprop to get the revision ids, timestamp, user, user ID, revision comment, and the size of the article; "max" to rvlimit to get all the revisions; "newer" to rvdir so the revisions come back in chronological order (oldest to newest). There are many other options that can be specified such as rvprop=content to get the content of each revision or rvstart and rvend to get revisions within a specific timeframe, and rvexcludeuser to omit changes from bots for example.



In [135]:

    
revisions_query_hrc = {'action': 'query',
                      'redirects': 'True',
                      'prop': 'revisions',
                      'titles': "Hillary Rodham Clinton",
                      'rvprop': 'ids|user|timestamp|userid|comment|size',
                      'rvlimit': '500',
                      'rvdir': 'newer'}

revisions_data_hrc = wikipedia_query(revisions_query_hrc)

There's a lot of data in there and you can already expect that we'll need to do some data processing and cleaning to get it into a more usable form.

As before, the query returns the list of revisions buried deep within a dictionary so we extract that out and, like we did with the user information in the very first section, we pass convert this list of revisions to a "pandas" DataFrame object that we'll call hrc_rv_df.
The timestamp column inside of this new DataFrame are still strings rather than meaningful dates that we can sort on, so we need to convert them using the to_datetime function and passing a strftime formatting magic so that we know which string sequences correspond to meaningful years, months, days, hours, minutes, and seconds values.
The anon column has a strange mixture of NaN and empty strings corresponding to whether the revision was made by a registered account or now. The replace method swaps the NaNs out with False and the strings with True booleans to make this more interpretable.
We sort the DataFrame on these newly-meaningful timestamp values, reset the index (row numbers) so they correspond to the revision count, and label this index as "revision".
Finally, we save the data to disk as a CSV file named hrc_revisions.csv making sure that we encode non-ASCII characters in "utf8". You'll want to make a habit out of doing this.



In [271]:

    
# Extract and convert to DataFrame
hrc_rv_df = pd.DataFrame(revisions_data_hrc['pages']['5043192']['revisions']) 

# Make it clear what's being edited
hrc_rv_df['page'] = [u'Hillary Rodham Clinton']*len(hrc_rv_df)

# Clean up timestamps
hrc_rv_df['timestamp'] = pd.to_datetime(hrc_rv_df['timestamp'],format="%Y-%m-%dT%H:%M:%SZ",unit='s')

# Clean up anon column
hrc_rv_df = hrc_rv_df.replace({'anon':{np.nan:False,u'':True}})

# Sort the data on timestamp and reset the index
hrc_rv_df = hrc_rv_df.sort('timestamp').reset_index(drop=True)
hrc_rv_df.index.name = 'revision'
hrc_rv_df = hrc_rv_df.reset_index()

# Set the index to a MultiIndex
hrc_rv_df.set_index(['page','revision'],inplace=True)

# Save the data to disk
hrc_rv_df.to_csv('hrc_revisions.csv',encoding='utf8')

# Show the first 5 rows
hrc_rv_df.head()









    Out[271]:






  
    
      
      
      anon
      comment
      parentid
      revid
      size
      timestamp
      user
      userid
    
    
      page
      revision
      
      
      
      
      
      
      
      
    
  
  
    
      Hillary Rodham Clinton
      0
       False
                                     *
            0
       256189
       380
      2001-08-01 20:21:17
          Koyaanis Qatsi
       90
    
    
      1
        True
       *added a bit on Hillary Clinton
       256189
       256190
       697
      2001-12-07 01:58:07
         152.163.197.xxx
        0
    
    
      2
       False
                  Took out the slander
       256190
       256191
       663
      2001-12-07 02:14:20
               Paul Drye
        6
    
    
      3
        True
                  Automated conversion
       256191
        72270
       877
      2002-02-25 15:51:15
       Conversion script
        0
    
    
      4
        True
                                     *
        72270
        72271
       920
      2002-05-18 16:37:57
          210.49.193.178
        0

User activity

We might be interested in looking at the most active editors over the history of the article. We can perform a groupby operation that effectively creates a mini-DataFrame for each user's revisions. We use the aggregate function to collection information (len gets us the number of revisions they made) across all these mini-DataFrames and returns a Series object with the username and the number of their revisions. Sorting these revisions is descending order and then look at the top-5 revisions, which exhibits variation over nearly two orders of magnitude.



In [272]:

    
hrc_rv_gb_user = hrc_rv_df.groupby('user')
hrc_user_revisions = hrc_rv_gb_user['revid'].aggregate(len).sort(ascending=False,inplace=False)
print "There are {0} unique users who have made a contribution to the article.".format(len(hrc_user_revisions))
hrc_user_revisions.head(10)









    



There are 3567 unique users who have made a contribution to the article.






    Out[272]:





user
Wasted Time R      2189
LukeTH              656
Tvoz                296
K157                137
Mark Miller          81
Anythingyouwant      78
StuffOfInterest      74
Ohnoitsjamie         56
Gamaliel             52
Kelw                 48
Name: revid, dtype: int64

Given the wide variation among the number of contributions from users, we can create a kind of "histogram" that plots how many users made how many revisions. Because there is so much variation in the data, we use logged axes. In the upper left, there are several thousand editors who made only a single contribution. In the lower right, are the single editors listed above who made several hundred revisions to this article.



In [273]:

    
revisions_counter = Counter(hrc_user_revisions.values)
plt.scatter(revisions_counter.keys(),revisions_counter.values(),s=50)
plt.ylabel('Number of users',fontsize=15)
plt.xlabel('Number of revisions',fontsize=15)
plt.yscale('log')
plt.xscale('log')

We'll add some information to the DataFrame about the cumulative number of unique users who've ever edited the article. This should give us a sense of how the size of the collaboration changed over time. Starting with empty lists for unique_users that we will add the names of users to as they make their first edit and unique_count that counts the number of unique users at each point in time. We add the unique_count list to the DataFrame under the unique_users column.



In [307]:

    
def count_unique_users(user_series):
    unique_users = []
    unique_count = []
    for user in user_series.values:
        if user not in unique_users:
            unique_users.append(user)
            unique_count.append(len(unique_users))
        else:
            unique_count.append(unique_count[-1])
    return unique_count
        
hrc_rv_df['unique_users'] = count_unique_users(hrc_rv_df['user'])

We can look at changes to the contribution patterns on the article over time. First we need to do some data processing to convert the timestamps into generic dates. Then we group the activity by date together and use aggregate to create a new DataFrame called activity_by_day that contains the number of unique users and number of revisions made on each day. Finally, plot the distribution of this activity over time.

Looking at the blue line for the number of unique users, we see the collaboration is initially small through 2004, but then between 2005 and 2008 undergoes rapid growth from a few hundred editors to over 3,000 editors. After 2008 however, the number of new news grows much more slowly and constantly. This is somewhat surprising as this timeframe includes a number of historic events like Hillary's campaign for president in 2008 as well as her tenure as Secretary of State.

Looking at the green line for the number of revisions made per day, there is a lot of variation in daily editing activity, but much of it seems to again occur between 2005 and 2009, and slows down substantially thereafter. Peaks might correspond to major news events (like nominations) or to edit wars (editors fighting over content).



In [275]:

    
hrc_rv_df['date'] = hrc_rv_df['timestamp'].apply(lambda x:x.date())
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,'revid':len})
ax = activity_by_day.plot(lw=1,secondary_y=['revid'])
ax.set_xlabel('Time',fontsize=15)









    Out[275]:





<matplotlib.text.Text at 0x2c58f5f8>

Changes in article size

We can also look at the distribution in changes to the article's size. In other words, how much content (in bytes) was introduced or removed from the article by an editor's changes? We see there is a very wide (axes are still on log scales) and mostly symmetrical distribution in additions and removals of content. In other words, the most frequent changes are extremely minor (-1 to 1 bytes) and the biggest changes (dozens of kilobytes) are very rare --- and likely the result of vandalism and reversion of vandalism. Nevertheless it's the case that this Wikipedia article's history is as much about the removal of content as it is about the addition of content.



In [276]:

    
hrc_rv_df['diff'] = hrc_rv_df['size'].diff()
diff_counter = Counter(hrc_rv_df['diff'].values)
plt.scatter(diff_counter.keys(),diff_counter.values(),s=50,alpha=.1)
plt.xlabel('Difference (bytes)',fontsize=15)
plt.ylabel('Number of revisions',fontsize=15)
plt.yscale('log')
plt.xscale('symlog')

Re-compute the activity_by_day DataFrame to include the diff variable computed above using the np.median method to get the median change in the article on a given day. Substantively, this means that we can track how much content was added or removed on each day. This is noisy, so we can smooth using rolling_mean and specifying a 60-day window. There's a general tendency for the articl to grow on any given day, but there are a few time periods when the article shrinks drastically, likely reflecting sections of an article being split out into sub-articles.



In [277]:

    
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,
                                                       'revid':len,
                                                       'diff':np.median})

# Compute a 60-day rolling average to remove spikiness, plot
pd.rolling_mean(activity_by_day['diff'],60).plot()
plt.yscale('symlog')
plt.xlabel('Time',fontsize=15)
plt.ylabel('Difference (bytes)',fontsize=15)
plt.axhline(0,lw=2,c='k')









    Out[277]:





<matplotlib.lines.Line2D at 0x2b78fd68>

Distribution of edit latencies

We can also explore how long an edit persists on the article before another edit is subsequently made. The average edit only persists for ~34,500 seconds (~9.5 hours) but the median edit only persists for 881 seconds (~15 minutes).



In [278]:

    
# The diff returns timedeltas, but dividing by a 1-second timedelta returns a float
# Round these numbers off to smooth out the distribution and add 1 second to everything to make the plot behave
hrc_rv_df['latency'] = [round(i/np.timedelta64(1,'s'),-1) + 1 for i in hrc_rv_df['timestamp'].diff().values]
diff_counter = Counter(hrc_rv_df['latency'].values)
plt.scatter(diff_counter.keys(),diff_counter.values(),s=50,alpha=.1)
plt.xlabel('Latency time (seconds)',fontsize=15)
plt.ylabel('Number of changes',fontsize=15)
plt.yscale('log')
plt.xscale('log')



In [279]:

    
hrc_rv_df['latency'].describe()









    Out[279]:





count       12045.000000
mean        34469.446658
std        216675.835396
min             1.000000
25%           141.000000
50%           881.000000
75%         13591.000000
max      10993011.000000
dtype: float64

As we did above, we can recompute activity_by_day to include daily median changes in the latency between edits. There is substantial variation in how long edits persist. Again, the pre-2006 era is marked by content that goes days or weeks without changes, but between 2006 and 2009 the time between edits becomes much shorter, presumably corresponding with the attention around her presidential campaign. After 2008, the time between changes increases again and stabilizes at its (smoothed) current value of around 2 days between edits.



In [280]:

    
activity_by_day = hrc_rv_df.groupby('date').aggregate({'unique_users':max,
                                                       'revid':len,
                                                       'diff':np.median,
                                                       'latency':np.median})

# Compute a 60-day rolling average to remove spikiness, plot
pd.rolling_mean(activity_by_day['latency'],60).plot()
plt.yscale('symlog')
plt.xlabel('Time',fontsize=15)
plt.ylabel('Latency time (seconds)',fontsize=15)









    Out[280]:





<matplotlib.text.Text at 0x2a1032e8>

Co-authorship network

We previously created a direct network of hyperlinks where the nodes were all articles and the edges indicated the direction of the link(s) between the central article and its neighbors. In this section, we're going to construct a different kind of network that contains a mixture of editors and articles and the edges indicate whether an editor contributed to an article. For simplicity's sake, we're going to start with the 1-step ego co-authorship network with the "Hillary Rodham Clinton" article and the set of editors who have ever made changes to it. Because there are two-types of nodes in this network (articles and editors) and editors can't edit editors and articles can't edit articles, we call this network a "bipartite network" (also known as an "affiliation" or "two-mode" network).

Even though bipartite networks are traditionally undirected, we're going to use a directed network because NetworkX does some wacky things when using an undirected network with bipartite properties. We're also going to make this a weighted network where the edges have values that correspond to the number of times an editor made a change to the article. This basically replicates the analysis we did above in "User Activity" but is an example of the information from the revision history that we might want to include in the network representation.

We go over every user in the user column inside hrc_rv_df and first check whether or not a (user,"Hillary Rodham Clinton") edge exists. If one already exists, then we increment its weight attribute by 1. Otherwise if there is no such edge in the network, we add a (user,"Hillary Rodham Clinton") edge with a weight of 1. We can inspect five of the edges to make sure this worked.



In [289]:

    
hrc_bg = nx.DiGraph()

for user in hrc_rv_df['user'].values:
    if hrc_bg.has_edge(user,u'Hillary Rodham Clinton'):
        hrc_bg[user][u'Hillary Rodham Clinton']['weight'] += 1
    else:
        hrc_bg.add_edge(user,u'Hillary Rodham Clinton',weight=1)

print "There are {0} nodes and {1} edges in the network.".format(hrc_bg.number_of_nodes(),hrc_bg.number_of_edges())

hrc_bg.edges(data=True)[:5]









    



There are 3568 nodes and 3567 edges in the network.






    Out[289]:





[(u'Mansfieldkelly', u'Hillary Rodham Clinton', {'weight': 7}),
 (u'12.173.64.204', u'Hillary Rodham Clinton', {'weight': 2}),
 (u'Ottava Rima', u'Hillary Rodham Clinton', {'weight': 2}),
 (u'JudithSouth', u'Hillary Rodham Clinton', {'weight': 1}),
 (u'Haroldandkumar', u'Hillary Rodham Clinton', {'weight': 2})]

Co-authorship network of the hyperlink neighborhood

Based on everything we did in the previous analysis to query the revisions, reshape and clean up the data, and extract new features for analysis, we are now going to write a big function that does all of this automatically. The function get_revision_df will accept an article name, perform the query, and proceed to do many of the steps outlined above, and returns a cleaned DataFrame at the end.



In [39]:

    
def get_revision_df(article):
    revisions_query = {'action': 'query',
                      'redirects': 'True',
                      'prop': 'revisions',
                      'titles': article,
                      'rvprop': 'ids|user|timestamp|user|userid|comment|size',
                      'rvlimit': '500',
                      'rvdir': 'newer'}

    revisions_data = wikipedia_query(revisions_query)
    page_id = revisions_data['pages'].keys()[0]

    # Extract and convert to DataFrame. Try/except for links to pages that don't exist
    try:
        df = pd.DataFrame(revisions_data['pages'][page_id]['revisions'])
    except KeyError:
        print u"{0} doesn't exist!".format(article)
        pass

    # Make it clear what's being edited
    df['page'] = [article]*len(df)

    # Clean up timestamps
    df['timestamp'] = pd.to_datetime(df['timestamp'],format="%Y-%m-%dT%H:%M:%SZ",unit='s')

    # Clean up anon column. If/else for articles that have all non-anon editors
    if 'anon' in df.columns:
        df = df.replace({'anon':{np.nan:False,u'':True}})
    else:
        df['anon'] = [False] * len(df)

    # Sort the data on timestamp and reset the index
    df = df.sort('timestamp').reset_index(drop=True)
    df.index.name = 'revision'
    df = df.reset_index()

    # Set the index to a MultiIndex
    df.set_index(['page','revision'],inplace=True)
    
    # Compute additional features
    df['date'] = df['timestamp'].apply(lambda x:x.date())
    df['diff'] = df['size'].diff()
    df['unique_users'] = count_unique_users(df['user'])
    df['latency'] = [round(i/np.timedelta64(1,'s'),-1) + 1 for i in df['timestamp'].diff().values]
    
    # Don't return random other columns
    df = df[[u'anon',u'comment',u'parentid',
             u'revid',u'size',u'timestamp',
             u'user',u'userid',u'unique_users',
             u'date', u'diff', u'latency']]
    
    return df

Try this out on "Bill Clinton".



In [79]:

    
bc_rv_df = get_revision_df("Bill Clinton")
bc_rv_df.head()









    



Server lag, sleeping for 6 seconds






    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-79-39fe8fbb441e> in <module>()
----> 1 bc_rv_df = get_revision_df("Bill Clinton")
      2 bc_rv_df.head()

<ipython-input-39-f79189062d9b> in get_revision_df(article)
     41     df['date'] = df['timestamp'].apply(lambda x:x.date())
     42     df['diff'] = df['size'].diff()
---> 43     df['unique_users'] = count_unique_users(df['user'])
     44     df['latency'] = [round(i/np.timedelta64(1,'s'),-1) + 1 for i in df['timestamp'].diff().values]
     45 

NameError: global name 'count_unique_users' is not defined

We've created a DataFrame for both Hillary's revision history (hrc_rv_df) as well as Bill's revision history (bc_df). We can now combine both of these together (cross your fingers!!!) using the concat method. We can check to make sure that they both made it into the dataframe by checking the first level of the index and we see they're both there. We also save all the data we've scraped and cleaned to disk --- the resulting file takes up just under 5 MB.



In [321]:

    
clinton_df = pd.concat([bc_rv_df,hrc_rv_df])

print clinton_df.index.levels[0]
print "There are a total of {0} revisions across both the Hillary and Bill Clinton articles.".format(len(clinton_df))

clinton_df.to_csv('clinton_revisions.csv',encoding='utf8')









    



Index([u'Bill Clinton', u'Hillary Rodham Clinton'], dtype='object')
There are a total of 26952 revisions across both the Hillary and Bill Clinton articles.

Clinton co-authorship network

We are going to use these data to create a coauthorship network of all the editors who contributed to both these articles. If we've already crawled this data, we can just load it from disk, specifying options to make sure we have the right encoding, the columns are properly indexed, and the dates are parsed.



In [41]:

    
clinton_df = pd.read_csv('clinton_revisions.csv',
                          encoding='utf8',
                          index_col=['page','revision'],
                          parse_dates=['timestamp','date'])
clinton_df.head()









    Out[41]:






  
    
      
      
      anon
      comment
      parentid
      revid
      size
      timestamp
      user
      userid
      unique_users
      date
      diff
      latency
    
    
      page
      revision
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Bill Clinton
      0
       True
                                    builing -> building*
       330742655
       331410539
       6851
      2001-10-26 16:25:00
        212.188.19.xxx
       0
       1
      2001-10-26
       NaN
           NaN
    
    
      1
       True
                                                       *
       331410539
          238014
       7006
      2001-11-17 22:51:50
               Wmorrow
       0
       2
      2001-11-17
       155
       1924011
    
    
      2
       True
                                                       *
          238014
          238015
       7046
      2001-12-08 22:14:42
       208.144.114.xxx
       0
       3
      2001-12-08
        40
       1812171
    
    
      3
       True
       brady bill, partial birth abortion veto, DOMA ...
          238015
          238016
       7317
      2001-12-09 03:01:09
                Alan_D
       0
       4
      2001-12-09
       271
         17191
    
    
      4
       True
                            DADT was implemented in 1993
          238016
          238017
       7323
      2001-12-09 08:15:36
              Dmerrill
       0
       5
      2001-12-09
         6
         18871

We want to create an "edgelist" that contains all the (editor, article) pairs of who contributed to which articles. This could by done by looping over the list, but this is inefficient on larger datasets like the one we crawled. Instead, we'll use a groupby approach to not only count the number of times an editor contributed to an article (the weight we defined previous), but a whole host of other potentially interesting attributes.

We use the agg method on the data that's been grouped by page and user to aggregate the information into nice summary statistics. We count the number of revisions using len and relabel this variable weight. For the timestamp, diff, latency, and revision variables, we compute new summary statistics for the minimum, median, and maximum values. This operation returns a new DataFrame, indexed by (page, user) with columns corresponding to labels like weight, ts_min, etc. Each row in this DataFrame will become attributes in the graph object we make below. This operation creates a weird multi-column, so we drop the redundant 0-level of the column to have a nice concise column.

We're going to do something different to the timestamp data because these data are stored as Timestamp objects that don't always place nicely with other functions. Instead, we're going to convert these data to counts for the amount of time (in days) since January 16, 2001, the date that Wikipedia was founded. In effect, we're counting how "old" Wikipedia was when an action occurred and this float count will work better in subsequent steps.



In [43]:

    
clinton_gb_edge = clinton_df.reset_index().groupby(['page','user'])
clinton_edgelist = clinton_gb_edge.agg({'revid':{'weight':len},
                                        'timestamp':{'ts_min':np.min,'ts_max':np.max},
                                        'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
                                        'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
                                        'revision':{'revision_min':np.min,'revision_max':np.max}
                                        })

# Drop the legacy/redundant column names
clinton_edgelist.columns = clinton_edgelist.columns.droplevel(0)

# Convert the ts_min and ts_max to floats for the number of days since Wikipedia was founded
clinton_edgelist['ts_min'] = (clinton_edgelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_edgelist['ts_max'] = (clinton_edgelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

clinton_edgelist.head()









    Out[43]:






  
    
      
      
      ts_min
      ts_max
      revision_min
      revision_max
      weight
      latency_min
      latency_median
      latency_max
      diff_median
      diff_max
      diff_min
    
    
      page
      user
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Bill Clinton
      00666
       2083.873866
       2083.941644
        9970
        9972
       3
        1551
        3751
        4311
           0.0
        5072
      -72
    
    
      1.21 jigwatts
       2336.214977
       2903.078646
       11337
       12762
       2
         211
       10951
       21691
       47380.0
       94780
      -20
    
    
      100110100
       2098.175683
       2098.207037
       10096
       10101
       6
          31
         656
       42181
          -0.5
          23
      -33
    
    
      108.18.185.163
       3483.056007
       3483.056007
       13668
       13668
       1
       41581
       41581
       41581
           4.0
           4
        4
    
    
      10shistory
       1861.760359
       1864.803692
        5663
        5676
       2
         461
       29186
       57911
          56.0
         124
      -12

The nodes in this bipartite network also have attributes we can extract from the data. Remember, because this is a bipartite network, we'll need to generate attribute data for both the users and the pages. We can perform an analogous groupby operation as we used above, but simply group on either the user or the page values. After each of these groupby operations, we can perform similar agg operations to aggregate the data into summary statistics. In the case of the user, these summary statistics are across all articles in the data. Thus the clinton_usernodelist summarizes how many total edits a user made, their first and last observed edits, and the distribution of their diff, latency, and revision statistics. The clinton_pagenodelist summarizes how many total edits were made to the page, the date of the first and last edit, and so on.



In [44]:

    
# Create the usernodelist by grouping on user and aggregating
clinton_gb_user = clinton_df.reset_index().groupby(['user'])
clinton_usernodelist = clinton_gb_user.agg({'revid':{'revisions':len},
                                            'timestamp':{'ts_min':np.min,'ts_max':np.max},
                                            'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
                                            'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
                                            'revision':{'revision_min':np.min,'revision_median':np.median,'revision_max':np.max}
                                            })

# Clean up the columns and convert the timestamps to counts
clinton_usernodelist.columns = clinton_usernodelist.columns.droplevel(0)
clinton_usernodelist['ts_min'] = (clinton_usernodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_usernodelist['ts_max'] = (clinton_usernodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')


# Create the usernodelist by grouping on page and aggregating
clinton_gb_page = clinton_df.reset_index().groupby(['page'])
clinton_pagenodelist = clinton_gb_page.agg({'revid':{'revisions':len},
                                            'timestamp':{'ts_min':np.min,'ts_max':np.max},
                                            'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
                                            'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
                                            'revision':{'revision_min':np.min,'revision_median':np.median,'revision_max':np.max}
                                            })

# Clean up the columns and convert the timestamps to counts
clinton_pagenodelist.columns = clinton_pagenodelist.columns.droplevel(0)
clinton_pagenodelist['ts_min'] = (clinton_pagenodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
clinton_pagenodelist['ts_max'] = (clinton_pagenodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

clinton_pagenodelist.head()









    Out[44]:






  
    
      
      ts_min
      ts_max
      revision_min
      revision_max
      revision_median
      revisions
      latency_min
      latency_median
      latency_max
      diff_median
      diff_max
      diff_min
    
    
      page
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Bill Clinton
       283.684028
       5008.568900
       0
       14905
       7452.5
       14906
       1
       891
        6267951
       1
        417533
       -417533
    
    
      Hillary Rodham Clinton
       197.848113
       5003.015405
       0
       12045
       6022.5
       12046
       1
       881
       10993011
       3
       1878770
      -1878770

Now that we've created all this rich contextual data about edges, pages, and editors, we can load it all into a NetworkX DiGraph object called clinton_g. We start by looping over the index in the clinton_edgelist dataframe that corresponds to the edges in the network, convert the edge attributes to a dictionary for NetworkX to better digest, and then add this edge and all its data to the clinton_g graph object. This creates placeholder nodes, but we want to add the rich node data we created above as well. We can loop over the clinton_usernodelist, convert the node attributes to a dictionary, and then overwrite the placeholder nodes by adding the data-rich user nodes to the clinton_g graph object. We do the same for the clinton_pagenodelist, then check the number of nodes and edges in the network, and finally print out a few examples of the data-rich nodes and edges.



In [45]:

    
clinton_g = nx.DiGraph()
# Add the edges and edge attributes
for (article,editor) in iter(clinton_edgelist.index.values):
    edge_attributes = dict(clinton_edgelist.ix[(article,editor)])
    clinton_g.add_edge(editor,article,edge_attributes)

# Add the user nodes and attributes
for node in iter(clinton_usernodelist.index):
    node_attributes = dict(clinton_usernodelist.ix[node])
    clinton_g.add_node(node,node_attributes)

# Add the page nodes and attributes
for node in iter(clinton_pagenodelist.index):
    node_attributes = dict(clinton_pagenodelist.ix[node])
    clinton_g.add_node(node,node_attributes)
    
print "There are {0} nodes and {1} edges in the network.".format(clinton_g.number_of_nodes(),clinton_g.number_of_edges())

clinton_g.edges(data=True)[:3]









    



There are 8539 nodes and 9270 edges in the network.






    Out[45]:





[(u'Mansfieldkelly',
  u'Hillary Rodham Clinton',
  {'diff_max': 9285.0,
   'diff_median': 218.0,
   'diff_min': 29.0,
   'latency_max': 106201.0,
   'latency_median': 121.0,
   'latency_min': 21.0,
   'revision_max': 6392.0,
   'revision_min': 6381.0,
   'ts_max': 2307.2359837962963,
   'ts_min': 2307.2230671296297,
   'weight': 7.0}),
 (u'JJGlendenning',
  u'Bill Clinton',
  {'diff_max': 199.0,
   'diff_median': 3.0,
   'diff_min': 0.0,
   'latency_max': 74931.0,
   'latency_median': 641.0,
   'latency_min': 371.0,
   'revision_max': 3008.0,
   'revision_min': 2989.0,
   'ts_max': 1706.7711458333333,
   'ts_min': 1705.1043865740742,
   'weight': 3.0}),
 (u'24.254.92.184',
  u'Bill Clinton',
  {'diff_max': -57410.0,
   'diff_median': -57410.0,
   'diff_min': -57410.0,
   'latency_max': 49391.0,
   'latency_median': 49391.0,
   'latency_min': 49391.0,
   'revision_max': 1879.0,
   'revision_min': 1879.0,
   'ts_max': 1566.8687615740741,
   'ts_min': 1566.8687615740741,
   'weight': 1.0})]

Scrape revisions for all the articles in the hyperlink network

Now it's time to do a really audacious data scrape. We're going to get the revision histories for all 2,646 articles linked to and from Hillary's article. The data will be stored in the dataframe_dict dictionary that will be keyed by article title and the values will be the dataframes themselves. We'll use a for loop to go over every article in the all_links and call the get_revision_df function we defined and tested above to get the cleaned revision DataFrame and store it in the dataframe_dict object. Because this scrape may take a while, we're going to put in some exception handling (try, except) so that if an error occurs, we don't lose all our progress. When an exception occurs, we'll add the article name to the errors list so we can go back and check what happened.

We'll concatenate all these DataFrames together into a gigantic DataFrame containing all the data we've scraped and then save it. This is a 485 MB file!

This will take a long time and a lot of memory!!! To prevent you from accidentally executing this, the block below is in a "raw" format that you'll need to convert to "Code" from the dropdown above.

# List of DataFrames dataframe_dict = {u'Bill Clinton': bc_rv_df, u'Hillary Rodham Clinton': hrc_rv_df} # Set operations all_links = list(set(hrc_alllink_outlist) | set(hrc_alllink_inlist)) # Start the scrape errors = list() for article in all_links: try: df = get_revision_df(article) dataframe_dict[article] = df except: errors.append(article) pass gigantic_df = pd.concat(dataframe_dict.values()) gigantic_df.to_csv('gigantic_df.csv',encoding='utf8')

And there are nearly 3 millions revisions in the dataset!



In [338]:

    
len(gigantic_df)









    Out[338]:





2889692

Make a coauthorship network

The analysis can start again here by loading the CSV file rather than having to re-scrape the data from above. Loading the file to gigantic_df, there are a few rows that seem to be broken, so we'll use drop to remove them. We also use to_datetime to make sure the timestamp information is using the appropriate units.



In [46]:

    
gigantic_df = pd.read_csv('gigantic_df.csv',
                          encoding='utf8',
                          index_col=['page','revision'],
                          parse_dates=['timestamp','date']
                          )

gigantic_df = gigantic_df.drop(("[[History of the United States]] at [[History of the United States#British colonization|British Colonization]]. ([[WP:TW|TW]])",589285361))
gigantic_df = gigantic_df.drop(("United States",32868))

gigantic_df['timestamp'] = pd.to_datetime(gigantic_df['timestamp'],unit='s')
gigantic_df['date'] = pd.to_datetime(gigantic_df['date'],unit='d')

gigantic_df.head()









    



/Users/brianckeegan/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:1139: DtypeWarning: Columns (2,4,5) have mixed types. Specify dtype option on import or set low_memory=False.
  data = self._reader.read(nrows)






    Out[46]:






  
    
      
      
      anon
      comment
      parentid
      revid
      size
      timestamp
      user
      userid
      unique_users
      date
      diff
      latency
    
    
      page
      revision
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Yucaipa Companies
      0
       False
                                         create org-stub
              0
       38555542
        799
      2006-02-07 02:16:04
                 Rj
        43158
       1
      2006-02-07
      NaN
           NaN
    
    
      1
       False
                                                      hq
       38555542
       38557692
        878
      2006-02-07 02:32:44
                 Rj
        43158
       1
      2006-02-07
       79
          1001
    
    
      2
       False
                                             [[GameSpy]]
       38557692
       38565610
        920
      2006-02-07 03:42:00
                 Rj
        43158
       1
      2006-02-07
       42
          4161
    
    
      3
       False
       Disambiguate [[Franchise]] to [[Franchising]] ...
       38565610
       40811579
        932
      2006-02-23 04:12:35
            Deville
       364144
       2
      2006-02-23
       12
       1384241
    
    
      4
        True
                                                     NaN
       40811579
       49638872
       1025
      2006-04-22 19:51:18
       67.190.40.99
            0
       3
      2006-04-22
       93
       5067521

Now do all the groupby and agg operations to create the edgelists and nodelists we'll need to make a network as well as the data cleanup steps we did above.



In [47]:

    
edge_agg_function = {'revid':{'weight':len},
                     'timestamp':{'ts_min':np.min,'ts_max':np.max},
                     'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
                     'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
                     'revision':{'revision_min':np.min,'revision_max':np.max}
                     }

# Create the edgelist by grouping on both page and user 
gigantic_gb_edge = gigantic_df.reset_index().groupby(['page','user'])
gigantic_edgelist = gigantic_gb_edge.agg(edge_agg_function)

# Drop the legacy/redundant column names
gigantic_edgelist.columns = gigantic_edgelist.columns.droplevel(0)

# Convert the ts_min and ts_max to floats for the number of days since Wikipedia was founded
gigantic_edgelist['ts_min'] = (gigantic_edgelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_edgelist['ts_max'] = (gigantic_edgelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

print "There are {0} edges in the network.".format(len(gigantic_edgelist))









    



There are 1198564 edges in the network.



In [48]:

    
node_agg_function = {'revid':{'revisions':len},
                     'timestamp':{'ts_min':np.min,'ts_max':np.max},
                     'diff':{'diff_min':np.min,'diff_median':np.median,'diff_max':np.max},
                     'latency':{'latency_min':np.min,'latency_median':np.median,'latency_max':np.max},
                     'revision':{'revision_min':np.min,'revision_median':np.median,'revision_max':np.max}
                     }

# Create the usernodelist by grouping on user and aggregating
gigantic_gb_user = gigantic_df.reset_index().groupby(['user'])
gigantic_usernodelist = gigantic_gb_user.agg(node_agg_function)

# Clean up the columns and convert the timestamps to counts
gigantic_usernodelist.columns = gigantic_usernodelist.columns.droplevel(0)
gigantic_usernodelist['ts_min'] = (gigantic_usernodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_usernodelist['ts_max'] = (gigantic_usernodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

print "There are {0} editor nodes in the network.".format(len(gigantic_usernodelist))

gigantic_usernodelist.head()









    



There are 609386 editor nodes in the network.






    Out[48]:






  
    
      
      ts_min
      ts_max
      revision_min
      revision_max
      revision_median
      revisions
      latency_min
      latency_median
      latency_max
      diff_median
      diff_max
      diff_min
    
    
      user
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      !!2011WorldProtests!!
       3667.980139
       3671.951539
         473
        4059
        3106.0
       33
           31
          111
         5031
        57.0
       324
       -22
    
    
      !!2011WorldProtests!!Appletart!!
       3667.906146
       3667.917222
        2459
        2466
        2462.5
        2
          151
          301
          451
        13.5
        27
         0
    
    
      !"£$
       2103.991366
       2103.991366
         664
         664
         664.0
        1
        94341
        94341
        94341
       113.0
       113
       113
    
    
      !1029qpwoalskzmxn
       3032.998449
       3032.998449
       17228
       17228
       17228.0
        1
        15381
        15381
        15381
       185.0
       185
       185
    
    
      !ComputerAlert!
       3328.060347
       3328.060347
        1266
        1266
        1266.0
        1
       426451
       426451
       426451
        21.0
        21
        21



In [49]:

    
# Create the usernodelist by grouping on page and aggregating
gigantic_gb_page = gigantic_df.reset_index().groupby(['page'])
gigantic_pagenodelist = gigantic_gb_page.agg(node_agg_function)

# Clean up the columns and convert the timestamps to counts
gigantic_pagenodelist.columns = gigantic_pagenodelist.columns.droplevel(0)
gigantic_pagenodelist['ts_min'] = (gigantic_pagenodelist['ts_min'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')
gigantic_pagenodelist['ts_max'] = (gigantic_pagenodelist['ts_max'] - pd.Timestamp('2001-1-16'))/np.timedelta64(1,'D')

print "There are {0} page nodes in the network.".format(len(gigantic_pagenodelist))

gigantic_pagenodelist.head()









    



There are 2644 page nodes in the network.






    Out[49]:






  
    
      
      ts_min
      ts_max
      revision_min
      revision_max
      revision_median
      revisions
      latency_min
      latency_median
      latency_max
      diff_median
      diff_max
      diff_min
    
    
      page
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      10 Janpath
       1952.730394
       4918.583762
       0
         86
         43.0
         87
       21
        142856
       34497461
       13.5
         568
        -548
    
    
      107th United States Congress
       1068.466620
       4866.595417
       0
        497
        248.5
        498
       11
         40651
       11445051
        9.0
       95328
      -95328
    
    
      11/22/63
       3697.706644
       4997.840532
       0
        955
        477.5
        956
        1
          2041
        4650531
        5.0
       34146
      -34146
    
    
      111th United States Congress
       1647.791111
       5001.887164
       0
       3410
       1705.0
       3411
       11
          1476
       27124741
        1.0
       26558
      -17513
    
    
      14 Women
       2805.139792
       4828.690787
       0
         39
         19.5
         40
       21
       2127091
       17709771
       16.0
         705
         -95

Having created the edge and node lists in the previous step, we can now add these data to a NetworkX DiGraph object we'll call gigantic_g. As before, we add the edges and edge attributes from gigantic_edgelist and then add the nodes and node attribtues from gigantic_usernodelist and gigantic_pagenodelist. We perform a dictionary comprehension to convert the values of the attribtues in the dictionary to float data type rather than the numpy.float64 which doesn't play nicely with the graph writing function in NetworkX. And then we can do the "grand reveal" to describe the coauthorship network of the articles in the hyperlink network neighborhood of Hillary's article.



In [70]:

    
gigantic_g = nx.DiGraph()
# Add the edges and edge attributes
for (article,editor) in iter(gigantic_edgelist.index.values):
    edge_attributes = dict(gigantic_edgelist.ix[(article,editor)])
    edge_attributes = {k:float(v) for k,v in edge_attributes.iteritems()}
    gigantic_g.add_edge(editor,article,edge_attributes)

# Add the user nodes and attributes
for node in iter(gigantic_usernodelist.index):
    node_attributes = dict(gigantic_usernodelist.ix[node])
    node_attributes = {k:float(v) for k,v in node_attributes.iteritems()}
    gigantic_g.add_node(node,node_attributes)

# Add the page nodes and attributes
for node in iter(gigantic_pagenodelist.index):
    node_attributes = dict(gigantic_pagenodelist.ix[node])
    node_attributes = {k:float(v) for k,v in node_attributes.iteritems()}
    gigantic_g.add_node(node,node_attributes)
    
print "There are {0} nodes and {1} edges in the network.".format(gigantic_g.number_of_nodes(),gigantic_g.number_of_edges())

gigantic_g.edges(data=True)[:3]









    



There are 612010 nodes and 1198564 edges in the network.






    Out[70]:





[(u'168.169.192.149',
  u'John McCain',
  {'diff_max': 33.0,
   'diff_median': 13.5,
   'diff_min': -6.0,
   'latency_max': 71.0,
   'latency_median': 51.0,
   'latency_min': 31.0,
   'revision_max': 858.0,
   'revision_min': 857.0,
   'ts_max': 1893.7183449074073,
   'ts_min': 1893.7175578703705,
   'weight': 2.0}),
 (u'128.230.245.28',
  u'John Marshall',
  {'diff_max': 50.0,
   'diff_median': 50.0,
   'diff_min': 50.0,
   'latency_max': 46211.0,
   'latency_median': 46211.0,
   'latency_min': 46211.0,
   'revision_max': 1164.0,
   'revision_min': 1164.0,
   'ts_max': 2809.7745601851852,
   'ts_min': 2809.7745601851852,
   'weight': 1.0}),
 (u'206.176.248.71',
  u'Anticoagulant',
  {'diff_max': 3.0,
   'diff_median': 1.0,
   'diff_min': -1.0,
   'latency_max': 1126091.0,
   'latency_median': 563116.0,
   'latency_min': 141.0,
   'revision_max': 199.0,
   'revision_min': 198.0,
   'ts_max': 3592.6432523148146,
   'ts_min': 3592.6416435185183,
   'weight': 2.0})]

Finally, having gone through all this effort to make a co-authorship network with such rich attributes and complex properties, we should save our work. There are many different file formats for storing network objects to disk, but the two I use the most are "graphml" and "gexf". They do slightly different things, but they're generally interoperable and compatible with many programs for visualizing networks like Gephi.



In [71]:

    
nx.write_graphml(gigantic_g,'gigantic_g.graphml')

Analyze the `gigantic_g` network

Now let's perform some basic network analyses on this gigantic graph we've created. An extremely easy and important metric to compute is the degree centrality of nodes in the network: how well-connected a node is based on the number of edges it has to other nodes. We use the directed nature of the edges to distinguish between articles (which receive links in) and editors (which send links out) to compute the in- and out-degree centralities respectively with the nx.in_degree_centrality and nx.out_degree_centrality functions. These functions return a normalized degree centrality, where the values aren't the integer count of the number of connected edges, but rather the fraction of other nodes connected to which it's connected. The values are recorded in a dictionary keyed by the node ID (article title or user name), which are saved as g_idc and g_odc.



In [51]:

    
g_idc = nx.in_degree_centrality(gigantic_g)
g_odc = nx.out_degree_centrality(gigantic_g)

We can use a fancy bit of programming called itemgetter to quickly sort these dictionaries and return the 10-top connected articles and users. Hillary, despite being the central node we started at, is not actually the best-connected article, but rather other major people and entities. The top editors, interestingly enough are not actually people, but automated bots who perform a variety of maintenance and cleanup tasks across articles.



In [58]:

    
sorted(g_idc.iteritems(), key=itemgetter(1),reverse=True)[:10]









    Out[58]:





[(u'George W. Bush', 0.023424492123481844),
 (u'United States', 0.015543889060454993),
 (u'World War II', 0.012124004712349002),
 (u'Chicago', 0.011112581677720425),
 (u'Barack Obama', 0.010483505961513638),
 (u'India', 0.010083185051200228),
 (u'Ronald Reagan', 0.009625675439413473),
 (u'Diana, Princess of Wales', 0.009333196080449796),
 (u'Bill Clinton', 0.009318490414356652),
 (u'New York City', 0.00905215446178079)]



In [53]:

    
sorted(g_odc.iteritems(), key=itemgetter(1),reverse=True)[:10]









    Out[53]:





[(u'SmackBot', 0.0032515861694844355),
 (u'Addbot', 0.002730352004627383),
 (u'Cydebot', 0.002475453792346191),
 (u'Yobot', 0.0024378726456637076),
 (u'RjwilmsiBot', 0.0023659782780972175),
 (u'AnomieBOT', 0.0019640234048845687),
 (u'ClueBot NG', 0.0019199064066051316),
 (u'Rjwilmsi', 0.00177775163437139),
 (u'ClueBot', 0.0017140270813010919),
 (u'FrescoBot', 0.001547362865578774)]

We can plot a histogram of connectivity patterns for the articles and editors, which shows a very skewed distribution: most editors edit only a single article while there are single editors who make thousands of contributions. The distribution for articles shows a less severe but still very long-tailed distribution of contribution patterns.



In [54]:

    
g_size = gigantic_g.number_of_nodes()
g_idc_counter = Counter([v*(g_size-1) for v in g_idc.itervalues() if v != 0])
g_odc_counter = Counter([v*(g_size-1) for v in g_odc.itervalues() if v != 0])

plt.scatter(g_idc_counter.keys(),g_idc_counter.values(),s=50,c='b',label='Articles')
plt.scatter(g_odc_counter.keys(),g_odc_counter.values(),s=50,c='r',label='Editors')
plt.yscale('log')
plt.xscale('log')
plt.xlabel('Number of connections',fontsize=15)
plt.ylabel('Number of nodes',fontsize=15)
plt.legend(loc='upper right',scatterpoints=1)









    Out[54]:





<matplotlib.legend.Legend at 0x1403ce890>

We can also look at the distribution of edge weights, or the number of times that an editor contributed to an article. We could do this using the gigantic_edgelist DataFrame, but let's practice using the data attributes we've stored in the graph object. Using a list comprehension as before, we iterate (note the use of edes_iter(data=True) to both be more memory efficient and to return the edge attributes) over the edges which return a tuple (i,j,attributes_dict). We access the tuples' weights and store them in the list weights. Proceed with a Counter operation and then plot the results.



In [55]:

    
weights = [attributes['weight'] for i,j,attributes in gigantic_g.edges_iter(data=True)]
weight_counter = Counter(weights)

plt.scatter(weight_counter.keys(),weight_counter.values(),s=50,c='b',label='Weights')
plt.yscale('log')
plt.xscale('log')
plt.xlabel('Number of contributions',fontsize=15)
plt.ylabel('Number of edges',fontsize=15)









    Out[55]:





<matplotlib.text.Text at 0x114ab9090>

We can compute another degree-related metric called "assortativity" that measures how well connected your neighbors are on average. We compute this statistic on the set of article and editor nodes using the nx.assortativity.average_degree_connectivity function with special attention to the direction of the ties as well as limiting the nodes to those in the set of pages or users, respectively. Plotting the distribution, both the articles and the editors exhibit negative correlations. In other words, for those editors (articles) connected with few articles (editors), those articles (editors) have a tendency to be well-connected to other nodes in the set. Conversely, for those editors (articles) connected with many articles (editors), those articles (editors) have a tendency to be poorly connected to other nodes in the set. Articles exhibit a stronger correlation than editors.



In [56]:

    
article_nn_degree = nx.assortativity.average_degree_connectivity(gigantic_g,source='in',target='out',nodes=gigantic_pagenodelist.index)
editor_nn_degree = nx.assortativity.average_degree_connectivity(gigantic_g,source='out',target='in',nodes=gigantic_usernodelist.index)

plt.scatter(article_nn_degree.keys(),article_nn_degree.values(),s=50,c='b',label='Articles',alpha=.5)
plt.scatter(editor_nn_degree.keys(),editor_nn_degree.values(),s=50,c='r',label='Editors',alpha=.5)
plt.yscale('log')
plt.xscale('log')
plt.xlabel('Degree',fontsize=15)
plt.ylabel('Average neighbor degree',fontsize=15)
plt.legend(loc='upper right',scatterpoints=1)









    Out[56]:





<matplotlib.legend.Legend at 0x13f0e9b50>

Testing other things out

Just trying other things.



In [72]:

    
gigantic_g.edges(data=True)[1]









    Out[72]:





(u'128.230.245.28',
 u'John Marshall',
 {'diff_max': 50.0,
  'diff_median': 50.0,
  'diff_min': 50.0,
  'latency_max': 46211.0,
  'latency_median': 46211.0,
  'latency_min': 46211.0,
  'revision_max': 1164.0,
  'revision_min': 1164.0,
  'ts_max': 2809.7745601851852,
  'ts_min': 2809.7745601851852,
  'weight': 1.0})



In [75]:

    
edge_weight_centrality = [list(),list()]
for (i,j,attributes) in gigantic_g.edges_iter(data=True):
    edge_weight_centrality[0].append(g_idc[j] - g_idc[i])
    edge_weight_centrality[1].append(attributes['weight'])



In [77]:

    
plt.scatter(edge_weight_centrality[0],edge_weight_centrality[1])
plt.yscale('log')

Appendix - There be dragons here

Get the links from neighboring articles

Throughout this section, we've gotten the links for a single article. As we did with the user information in the previous section, we can wrap these queries in a function so that they're easier to run. Once we do this, we can do more interesting things like examine the hyperlink ego-network surrounding a single article.

We take the lists of linked articles we extracted from Hillary's article and iterate over them, getting the lists of articles for each of them. We need to place the output of these into a larger data object that will hold everything. I'll use a dictionary keyed by article name that returns a dictionary containing the lists of links for that article. We'll put Hillary's data in there to start it up, but we'll add more.

Next we come up with a list of articles we're going to iterate over. We could just add the outlist and inlist articles together, but there might be redundancies in there. Instead we'll cast these lists into set containing only unique article names and the union of these sets creates a master set of all unique article names. Then convert this joined set back into a list called all_links so we can iterate over it.

This set of unique links has 2,646 articles in it, which will take some time to scrape all the data. This may take over an hour to run and will generate ~190MB of data: convert the cell below back to "Code" if you really want to execute it

# Start up the data structure link_data = {u'Hillary Rodham Clinton': {'Out':hrc_alllink_outlist, 'In':hrc_alllink_inlist}} # Set operations all_links = list(set(hrc_alllink_outlist) | set(hrc_alllink_inlist)) # Start the scrape for article in all_links: try: _out_links,_in_links = get_article_links(article) link_data[article] = {'Out':_out_links, 'In':_in_links} except: print article pass # Save the data with open('link_data.json','wb') as f: json.dump(link_data,f)



In [ ]:

    
dtype_dict = {'page':unicode,
              'revision':np.int64,
              'anon':bool,
              'comment':unicode,
              'parentid':np.int64,
              'size':np.int64,
              'timestamp':unicode,
              'user':unicode,
              'userid':np.int64,
              'unique_users':np.int64,
              'date':unicode,
              'diff':np.float64,
              'latency':np.float64
              }

	editcount	gender	groups	name	registration	userid
0	12348	male	[reviewer, *, user, autoconfirmed]	Madcoverboy	2005-06-21T13:52:16Z	304994
1	11778	unknown	[checkuser, founder, oversight, sysop, *, user...	Jimbo Wales	2001-03-27T20:47:31Z	24

	editcount	gender	groups	name	registration	userid
0	12348	male	[u'reviewer', u'*', u'user', u'autoconfirmed']	Madcoverboy	2005-06-21T13:52:16Z	304994
1	11778	unknown	[u'checkuser', u'founder', u'oversight', u'sys...	Jimbo Wales	2001-03-27T20:47:31Z	24

		anon	comment	parentid	revid	size	timestamp	user	userid
page	revision
Hillary Rodham Clinton	0	False	*	0	256189	380	2001-08-01 20:21:17	Koyaanis Qatsi	90
	1	True	*added a bit on Hillary Clinton	256189	256190	697	2001-12-07 01:58:07	152.163.197.xxx	0
	2	False	Took out the slander	256190	256191	663	2001-12-07 02:14:20	Paul Drye	6
	3	True	Automated conversion	256191	72270	877	2002-02-25 15:51:15	Conversion script	0
	4	True	*	72270	72271	920	2002-05-18 16:37:57	210.49.193.178	0

		anon	comment	parentid	revid	size	timestamp	user	userid	unique_users	date	diff	latency
page	revision
Bill Clinton	0	True	builing -> building*	330742655	331410539	6851	2001-10-26 16:25:00	212.188.19.xxx	0	1	2001-10-26	NaN	NaN
	1	True	*	331410539	238014	7006	2001-11-17 22:51:50	Wmorrow	0	2	2001-11-17	155	1924011
	2	True	*	238014	238015	7046	2001-12-08 22:14:42	208.144.114.xxx	0	3	2001-12-08	40	1812171
	3	True	brady bill, partial birth abortion veto, DOMA ...	238015	238016	7317	2001-12-09 03:01:09	Alan_D	0	4	2001-12-09	271	17191
	4	True	DADT was implemented in 1993	238016	238017	7323	2001-12-09 08:15:36	Dmerrill	0	5	2001-12-09	6	18871

		ts_min	ts_max	revision_min	revision_max	weight	latency_min	latency_median	latency_max	diff_median	diff_max	diff_min
page	user
Bill Clinton	00666	2083.873866	2083.941644	9970	9972	3	1551	3751	4311	0.0	5072	-72
	1.21 jigwatts	2336.214977	2903.078646	11337	12762	2	211	10951	21691	47380.0	94780	-20
	100110100	2098.175683	2098.207037	10096	10101	6	31	656	42181	-0.5	23	-33
	108.18.185.163	3483.056007	3483.056007	13668	13668	1	41581	41581	41581	4.0	4	4
	10shistory	1861.760359	1864.803692	5663	5676	2	461	29186	57911	56.0	124	-12

	ts_min	ts_max	revision_min	revision_max	revision_median	revisions	latency_min	latency_median	latency_max	diff_median	diff_max	diff_min
page
Bill Clinton	283.684028	5008.568900	0	14905	7452.5	14906	1	891	6267951	1	417533	-417533
Hillary Rodham Clinton	197.848113	5003.015405	0	12045	6022.5	12046	1	881	10993011	3	1878770	-1878770

		anon	comment	parentid	revid	size	timestamp	user	userid	unique_users	date	diff	latency
page	revision
Yucaipa Companies	0	False	create org-stub	0	38555542	799	2006-02-07 02:16:04	Rj	43158	1	2006-02-07	NaN	NaN
	1	False	hq	38555542	38557692	878	2006-02-07 02:32:44	Rj	43158	1	2006-02-07	79	1001
	2	False	[[GameSpy]]	38557692	38565610	920	2006-02-07 03:42:00	Rj	43158	1	2006-02-07	42	4161
	3	False	Disambiguate [[Franchise]] to [[Franchising]] ...	38565610	40811579	932	2006-02-23 04:12:35	Deville	364144	2	2006-02-23	12	1384241
	4	True	NaN	40811579	49638872	1025	2006-04-22 19:51:18	67.190.40.99	0	3	2006-04-22	93	5067521

	ts_min	ts_max	revision_min	revision_max	revision_median	revisions	latency_min	latency_median	latency_max	diff_median	diff_max	diff_min
user
!!2011WorldProtests!!	3667.980139	3671.951539	473	4059	3106.0	33	31	111	5031	57.0	324	-22
!!2011WorldProtests!!Appletart!!	3667.906146	3667.917222	2459	2466	2462.5	2	151	301	451	13.5	27	0
!"£$	2103.991366	2103.991366	664	664	664.0	1	94341	94341	94341	113.0	113	113
!1029qpwoalskzmxn	3032.998449	3032.998449	17228	17228	17228.0	1	15381	15381	15381	185.0	185	185
!ComputerAlert!	3328.060347	3328.060347	1266	1266	1266.0	1	426451	426451	426451	21.0	21	21

	ts_min	ts_max	revision_min	revision_max	revision_median	revisions	latency_min	latency_median	latency_max	diff_median	diff_max	diff_min
page
10 Janpath	1952.730394	4918.583762	0	86	43.0	87	21	142856	34497461	13.5	568	-548
107th United States Congress	1068.466620	4866.595417	0	497	248.5	498	11	40651	11445051	9.0	95328	-95328
11/22/63	3697.706644	4997.840532	0	955	477.5	956	1	2041	4650531	5.0	34146	-34146
111th United States Congress	1647.791111	5001.887164	0	3410	1705.0	3411	11	1476	27124741	1.0	26558	-17513
14 Women	2805.139792	4828.690787	0	39	19.5	40	21	2127091	17709771	16.0	705	-95

Wikipedia Network Analysis

Write a basic query against the Wikipedia API

Running a test query in the browser

Running the same query in Python

Write a function to simplify the process

Write the data we collected to disk

Summary

Create hyperlink network

List of articles currently linked from an article

A note on redirects

List of articles currently linking to an article

Combining queries

Make a network

Links from historical versions of an article

Summary

Get all the revisions made to an article

User activity

Changes in article size

Distribution of edit latencies

Co-authorship network

Co-authorship network of the hyperlink neighborhood

Clinton co-authorship network

Scrape revisions for all the articles in the hyperlink network

Make a coauthorship network

Analyze the gigantic_g network

Testing other things out

Appendix - There be dragons here

Get the links from neighboring articles

Analyze the `gigantic_g` network